Skip to content

Conversation

@072020127
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

  • Implemented logic for storing and reading persistent MapSpecs
  • Added retrieval of the current MapSpec
  • Implemented comparison logic for Map structure
  • Added unit tests and e2e test programs

Which issue(s) this PR fixes:

Fixes #1409

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @072020127, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant feature enabling seamless, traffic-undisrupted upgrades for the kmesh-daemon. It achieves this by persisting BPF map specifications, comparing them during daemon restarts, and intelligently migrating map data to new versions. This ensures that ongoing network traffic is not interrupted when the daemon is updated.

Highlights

  • Persistent BPF MapSpecs: Implemented logic for storing and reading persistent BPF MapSpecs, which are crucial for maintaining state across kmesh-daemon upgrades.
  • MapSpec Comparison Logic: Added retrieval of the current MapSpec and implemented robust comparison logic for BPF Map structures to detect changes between versions.
  • Seamless Daemon Upgrades: Introduced an 'Update' mode for kmesh-daemon, allowing BPF programs and maps to be updated without traffic disruption by migrating existing map data.
  • Comprehensive Testing: Included new unit tests for MapSpec loading and comparison, as well as an end-to-end test to validate the traffic-undisrupted upgrade process.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


BPF maps persist, Upgrades flow, no traffic lost, Code ensures smooth ride.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant feature for non-disruptive upgrades of the kmesh-daemon. The core logic involves persisting BPF map specifications (MapSpecs), comparing them across versions, and migrating maps when their structure changes. The implementation adds new files for handling BPF updates and loading MapSpecs, along with extensive unit and e2e tests.

My review focuses on the new update logic. I've found a few critical issues related to error handling that could lead to panics, and a bug where changes in struct bitfield sizes are not detected, potentially causing data corruption. I've also identified areas with unhandled errors and code duplication that should be addressed to improve robustness and maintainability. The suggestions provided aim to fix these issues and refactor the code for better clarity.

Comment on lines 275 to 264
if map_old.Offset != uint32(map_new.Offset) {
log.Info(map_old.BitfieldSize, uint32(map_new.BitfieldSize))
diff.OffsetChanged = true
break
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The BitfieldSize of a struct member is not being compared. This is a critical bug, as a change in bitfield size alters the struct layout and should trigger a map migration to avoid data corruption. The existing log.Info call should be replaced with a proper comparison.

  if map_old.Offset != uint32(map_new.Offset) || map_old.BitfieldSize != uint32(map_new.BitfieldSize) {
   diff.OffsetChanged = true
   break
  }

pkg/bpf/bpf.go Outdated
Comment on lines 235 to 236
mapspec, _ := restart.LoadCompileTimeSpecs(config)
restart.SnapshotSpecsByPkg(mapspec)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The error returned by restart.LoadCompileTimeSpecs is ignored. This could lead to restart.SnapshotSpecsByPkg being called with a nil mapspec, which might cause a panic. The error should be checked and handled to prevent this.

Suggested change
mapspec, _ := restart.LoadCompileTimeSpecs(config)
restart.SnapshotSpecsByPkg(mapspec)
mapspec, err := restart.LoadCompileTimeSpecs(config)
if err != nil {
log.Errorf("failed to load compile time specs: %v", err)
return nil
}
restart.SnapshotSpecsByPkg(mapspec)

Comment on lines 66 to 91
specsByPkg, err := LoadCompileTimeSpecs(config)
if specsByPkg == nil {
log.Errorf("load oldSpecsByPkg failed")
return nil
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The error returned by LoadCompileTimeSpecs is not handled. If an error occurs, specsByPkg could be nil and the function might proceed, potentially causing a panic later. It's crucial to check and handle this error. Also, the log message on line 68 refers to oldSpecsByPkg which seems to be a typo.

Suggested change
specsByPkg, err := LoadCompileTimeSpecs(config)
if specsByPkg == nil {
log.Errorf("load oldSpecsByPkg failed")
return nil
}
specsByPkg, err := LoadCompileTimeSpecs(config)
if err != nil {
log.Errorf("load compile-time map specs failed: %v", err)
return nil
}

Comment on lines 138 to 181
func migrateMap(
oldMapSpec *PersistedMapSpec,
newMapSpec *ebpf.MapSpec,
pkgName, mapName, pinPath string,
) (*ebpf.Map, error) {
if oldMapSpec == nil {
return createEmptyMap(newMapSpec, pinPath, mapName, nil)
}
if oldMapSpec.Type != newMapSpec.Type.String() ||
oldMapSpec.KeySize != newMapSpec.KeySize ||
oldMapSpec.ValueSize != newMapSpec.ValueSize ||
oldMapSpec.MaxEntries != newMapSpec.MaxEntries {
oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{})
if err == nil {
return createEmptyMap(newMapSpec, pinPath, mapName, oldMap)
} else {
return createEmptyMap(newMapSpec, pinPath, mapName, nil)
}
}

oldKeyStruct := oldMapSpec.KeyInfo
newKeyStruct, okNewKey := newMapSpec.Key.(*btf.Struct)
if okNewKey {
diff := diffStructInfoAgainstBTF(oldKeyStruct, newKeyStruct, make(map[string]bool))
if diff.Added || diff.Removed || diff.TypeChanged ||
diff.OffsetChanged || diff.NestedChanged {
oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{})
if err == nil {
return createEmptyMap(newMapSpec, pinPath, mapName, oldMap)
} else {
return createEmptyMap(newMapSpec, pinPath, mapName, nil)
}
}
} else {
if newMapSpec.Key == nil {
if len(oldKeyStruct.Members) != 0 || oldKeyStruct.Name != "" {
oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{})
if err == nil {
return createEmptyMap(newMapSpec, pinPath, mapName, oldMap)
} else {
return createEmptyMap(newMapSpec, pinPath, mapName, nil)
}
}
} else {
newKeyTypeName := newMapSpec.Key.TypeName()
if len(oldKeyStruct.Members) != 0 || newKeyTypeName != oldKeyStruct.Name {
oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{})
if err == nil {
return createEmptyMap(newMapSpec, pinPath, mapName, oldMap)
} else {
return createEmptyMap(newMapSpec, pinPath, mapName, nil)
}
}
}
}

oldValueStruct := oldMapSpec.ValueInfo
newValueStruct, okNewValue := newMapSpec.Value.(*btf.Struct)
if okNewValue {
diff := diffStructInfoAgainstBTF(oldValueStruct, newValueStruct, make(map[string]bool))
if diff.Added || diff.Removed || diff.TypeChanged ||
diff.OffsetChanged || diff.NestedChanged {
oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{})
if err == nil {
return createEmptyMap(newMapSpec, pinPath, mapName, oldMap)
} else {
log.Info(diff.Added ,diff.Removed,diff.TypeChanged ,
diff.OffsetChanged,diff.NestedChanged)
return createEmptyMap(newMapSpec, pinPath, mapName, nil)
}
}
} else {
if newMapSpec.Value == nil {
if len(oldValueStruct.Members) != 0 || oldValueStruct.Name != "" {
oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{})
if err == nil {
return createEmptyMap(newMapSpec, pinPath, mapName, oldMap)
} else {
return createEmptyMap(newMapSpec, pinPath, mapName, nil)
}
}
} else {
newValTypeName := newMapSpec.Value.TypeName()
if len(oldValueStruct.Members) != 0 || newValTypeName != oldValueStruct.Name {
oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{})
if err == nil {
return createEmptyMap(newMapSpec, pinPath, mapName, oldMap)
} else {
return createEmptyMap(newMapSpec, pinPath, mapName, nil)
}
}
}
}
log.Info("pass all check")
return nil, nil
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The migrateMap function has high complexity and significant code duplication. The logic to decide whether to recreate a map is repeated in multiple branches. This makes the code hard to read and maintain. Consider refactoring this function to extract the map recreation logic into a helper function and simplify the conditional checks.

@codecov
Copy link

codecov bot commented Sep 16, 2025

Codecov Report

❌ Patch coverage is 0.43573% with 457 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.84%. Comparing base (5951594) to head (e907308).
⚠️ Report is 28 commits behind head on main.

Files with missing lines Patch % Lines
pkg/bpf/restart/bpf_update.go 0.00% 325 Missing ⚠️
pkg/bpf/restart/new_version_mapspec_loader.go 0.00% 116 Missing ⚠️
pkg/bpf/bpf.go 15.38% 9 Missing and 2 partials ⚠️
pkg/bpf/ads/sock_connection.go 0.00% 1 Missing ⚠️
pkg/bpf/ads/sock_ops.go 0.00% 1 Missing ⚠️
pkg/bpf/workload/skb.go 0.00% 1 Missing ⚠️
pkg/bpf/workload/sock_connection.go 0.00% 1 Missing ⚠️
pkg/bpf/workload/sock_ops.go 0.00% 1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (0.43%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Files with missing lines Coverage Δ
pkg/bpf/ads/sock_connection.go 0.00% <0.00%> (ø)
pkg/bpf/ads/sock_ops.go 0.00% <0.00%> (ø)
pkg/bpf/workload/skb.go 0.00% <0.00%> (ø)
pkg/bpf/workload/sock_connection.go 0.00% <0.00%> (ø)
pkg/bpf/workload/sock_ops.go 0.00% <0.00%> (ø)
pkg/bpf/bpf.go 39.63% <15.38%> (-0.79%) ⬇️
pkg/bpf/restart/new_version_mapspec_loader.go 0.00% <0.00%> (ø)
pkg/bpf/restart/bpf_update.go 0.00% <0.00%> (ø)

... and 2 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d9cfcd...e907308. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned this is a very big change, could introduce regression. Can you please add a feature flag, we can turn on until we have enough tests and documents


// LoadCompileTimeSpecs loads all compile-time MapSpecs from bpf2go-generated packages.
// Returns a nested map: packageName to mapName to *ebpf.MapSpec
func LoadCompileTimeSpecs(config *options.BpfConfig) (map[string]map[string]*ebpf.MapSpec, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by CompileTime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is to describe that these specs are fixed when the binary is compiled (via bpf2go), as opposed to the specs loaded from the persisted on-disk snapshot. Maybe LoadEmbeddedSpecs will be clearer to reflect its role?

} else {
specs["KmeshCgroupSockWorkload"] = coll.Maps
}
if coll, err := dualengine.LoadKmeshCgroupSockWorkloadCompat(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should only call either LoadKmeshCgroupSockWorkload or LoadKmeshCgroupSockWorkloadCompat

please check

	if helper.KernelVersionLowerThan5_13() {
		spec, err = bpf2go.LoadKmeshSockopsCompat()
	} else {
		spec, err = bpf2go.LoadKmeshSockops()
	}


// LoadCompileTimeSpecs loads all compile-time MapSpecs from bpf2go-generated packages.
// Returns a nested map: packageName to mapName to *ebpf.MapSpec
func LoadCompileTimeSpecs(config *options.BpfConfig) (map[string]map[string]*ebpf.MapSpec, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering how we can not miss one map, even after we add a new map later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a go file for gen-bpf_stpecs.go in the/back directory, which can automatically generate LoadCompileTimeSpecs by analyzing the bpf2Go.go file. If there are new maps or progs in the future, it can be easily kept synchronized.

func LoadCompileTimeSpecs(config *options.BpfConfig) (map[string]map[string]*ebpf.MapSpec, error) {
specs := make(map[string]map[string]*ebpf.MapSpec)

if config.KernelNativeEnabled() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the tag //go:build enhanced, it is for kernel native mode, so we donot need the below else branch

pkg/bpf/bpf.go Outdated
return nil
}

mapspec, err := restart.LoadCompileTimeSpecs(config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mapspec, err := restart.LoadCompileTimeSpecs(config)
mapSpecs, err := restart.LoadCompileTimeSpecs(config)

// It will migrate any BPF maps whose on‑disk pin already exists but whose
// compiled MapSpec has changed
func UpdateMapHandler(versionMap *ebpf.Map, kmBpfPath string, config *options.BpfConfig) *ebpf.Map {
persistedSpecs, err := LoadPersistedSnapshot()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persistedSpecs can be nil, which cause panic later, please check all

log.Warnf("failed to remove old map pinpath: %v (continuing)", err)
}
case hasNew && !hasOld:
if _, err := migrateMap(nil, newSpec, pkgName, mapName, pinPath); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer directly use call the map create function

return versionMap
}

func updateVersionInfo(versionMap *ebpf.Map) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can share with that one storeVersionInfo in pkg/bpf/bpf.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to import package restart in pkg/bpf/bpf.go. If i call storeVersionInfo in pkg/bpf/bpf.go, this will result in recursive import.

}

// patchKmesh applies a strategic merge patch to the Kmesh DaemonSet and waits for rollout completion.
func patchKmesh_upgarde(t framework.TestContext, patchData string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo upgrade

func upgradeKmesh(t framework.TestContext) {
newImage := os.Getenv("KMESH_UPGRADE_IMAGE")
if newImage == "" {
newImage = "localhost:5000/kmesh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnot this same image with old daemonset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attached a new image which is build in run_test.sh set_daemonupgarde_testcase_image() to KMESH_UPGRADE_IMAGE. But E2e test istio 1.23~1.25 report no space left in device. I think the reason is there are two images existing ((localhost:5000/kmesh) and localhost:5000/kmesh/test-upgrade-map-change)

@kmesh-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kevin-wangzefeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@LiZhenCheng9527
Copy link
Contributor

We update configuration of e2e to fix this errors. Please rebase and then resubmit the code.

Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
@kmesh-bot
Copy link
Collaborator

Adding label do-not-merge/contains-merge-commits because PR contains merge commits, which are not allowed in this repository.
Use git rebase to reapply your commits on top of the target branch. Detailed instructions for doing so can be found here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[OSPP 2025] Kmesh-daemon upgrades traffic without disruption

4 participants